Skip to content

[SPARK-54405][SQL][Metric View] CREATE command and SELECT query resolution#53158

Closed
linhongliu-db wants to merge 17 commits intoapache:masterfrom
linhongliu-db:metric-view-create-and-select
Closed

[SPARK-54405][SQL][Metric View] CREATE command and SELECT query resolution#53158
linhongliu-db wants to merge 17 commits intoapache:masterfrom
linhongliu-db:metric-view-create-and-select

Conversation

@linhongliu-db
Copy link
Contributor

@linhongliu-db linhongliu-db commented Nov 21, 2025

What changes were proposed in this pull request?

This PR implements the command to create metric views and the analysis rule to resolve a metric view query:

  • CREATE Metric view
    • Add SQL grammar to support WITH METRIC when creating a view
    • Add dollar-quoted string support for YAML definitions
    • Implement CreateMetricViewCommand to analyze the view body
    • Use a table property to indicate that the View is a metric view since HIVE has no dedicated table type
  • SELECT Metric view
    • Update SessionCatalog to parse metric view definitions on read
    • Add MetricViewPlanner utility to parse the YAML definition and construct an unresolved plan
    • Add ResolveMetricView rule to substitute the dimensions and measures reference to actual expressions

NOTE: This PR depends on #53146

This PR also marks org.apache.spark.sql.metricview as an internal package

Why are the changes needed?

SPIP: Metrics & semantic modeling in Spark

Does this PR introduce any user-facing change?

No

How was this patch tested?

build/sbt "hive/testOnly  org.apache.spark.sql.execution.SimpleMetricViewSuite"
build/sbt "hive/testOnly  org.apache.spark.sql.hive.execution.HiveMetricViewSuite"

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 21, 2025
@linhongliu-db
Copy link
Contributor Author

cc @cloud-fan to review


mode DOLLAR_QUOTED_STRING_MODE;

DOLLAR_QUOTED_STRING_BODY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong indentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

val name = child match {
case v: ResolvedIdentifier =>
v.identifier.asTableIdentifier
case _ => throw QueryCompilationErrors.loadDataNotSupportedForV2TablesError()
Copy link
Contributor

@cloud-fan cloud-fan Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not happen, right? CheckAnalysis should fail earlier. We can throw internal error here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, removed

  - Add METRICS keyword to lexer
  - Add dollar-quoted string support for YAML definitions
  - Add createMetricView production rule
- Add METRIC_VIEW catalog table type
- Implement CreateMetricViewCommand:
  - Parse YAML definition
  - Build MetricViewPlaceholder logical node
  - Validate and analyze metric view
- Add MetricViewPlaceholder logical node with tree patterns
- Update ViewHelper to support metric view creation
- Add basic test suite for metric views

- Add MetricViewPlanner utility:
  - planRead() to parse metric view for SELECT queries
  - planWrite() refactored from metricViewCommands
  - parseYAML() shared parsing logic
- Add ResolveMetricView analyzer rule:
  - Transform MetricViewPlaceholder into aggregation queries
  - Parse dimensions and measures from schema metadata
  - Build Project with dimensions and Aggregate with measures
  - Handle measure references in aggregates
- Update SessionCatalog to parse metric view definitions on read
- Update EliminateView to handle ResolvedMetricView nodes
- Refactor CreateMetricViewCommand to use MetricViewPlanner
- Update ViewHelper to set METRIC_VIEW table type correctly
- Add ResolveMetricView to analyzer rule chain
- Update test suite with query tests

update

test

test
@linhongliu-db linhongliu-db force-pushed the metric-view-create-and-select branch from 0cbded9 to 9d97100 Compare December 10, 2025 19:10

override def prettyName: String = getTagValue(FunctionRegistry.FUNC_ALIAS).getOrElse("measure")

override def nullable: Boolean = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be child.nullable?

}
}

override def visitCodeLiteral(ctx: CodeLiteralContext): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we put this in AstBuilder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


if (ctx.METRICS(0) == null) {
throw QueryParsingErrors.missingClausesForOperation(
ctx, "WITH METRICS", "CREATE METRIC VIEW")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit misleading as we don't really support the syntax CREATE METRIC VIEW, shall we just say metric view creation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


if (ctx.routineLanguage(0) == null) {
throw QueryParsingErrors.missingClausesForOperation(
ctx, "LANGUAGE", "CREATE METRIC VIEW")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}

val languageCtx = ctx.routineLanguage(0)
withOrigin(languageCtx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this?

.getOrElse(Map.empty)
val codeLiteral = visitCodeLiteral(ctx.codeLiteral())

withIdentClause(ctx.identifierReference(), ident => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's simpler to do

      CreateMetricViewCommand(
        withIdentClause(...),
        userSpecifiedColumns,
        visitCommentSpecList(ctx.commentSpec()),
        properties,
        codeLiteral,
        allowExisting = ctx.EXISTS != null,
        replace = ctx.REPLACE != null
      )

* groupingExpressions = [region],
* aggregateExpressions = [region, sum(amount), avg(price)],
* child = Filter(upper(region) = 'REGION_1',
* Filter(product = 'product_1', sales_table))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the aforementioned Project?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added. and fixed the upper filter expression to use dimension AttributeReference.

import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._

override val output: Seq[Attribute] = Seq(
AttributeReference("result", StringType, nullable = false)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any output?

@github-actions github-actions bot added the BUILD label Dec 11, 2025
* 2. Load and parse the stored metric view definition from catalog metadata
* 3. Build a [[Project]] node that:
* - Projects dimension expressions: [region, upper(region) AS region_upper]
* - Includes non-conflicting source columns for filters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in the example, we filter by region_upper. I think the main reason is for the measure agg functions to reference columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's right. I updated the comment to make it more clear

// 3. metric view output should use the same exprId
val sourceProjList = sourceOutput.filterNot { attr =>
// conflict with dimensions
metricView.outputMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputMetrics contains both dimension and measure columns, shall we filter out dimension columns first before we look up the column name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (attr.metadata.contains(MetricViewConstants.COLUMN_TYPE_PROPERTY_KEY)) {
// no alias for metric view column since the measure reference needs to use the
// measure column in MetricViewPlaceholder, but an alias will change the exprId
attr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then will it have issues with DeduplicateRelation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test case to verify that union two same metric view query will work well.
In the test, before DeduplicateRelation, the plan is below

Union false, false
:- Aggregate [region#1903], [region#1903, sum(count#1910) AS total_count#1901L, avg(price#1911) AS avg_price#1902]
:  +- ResolvedMetricView `spark_catalog`.`default`.`test_metric_view`
:     +- Project [count#23 AS count#1910, price#24 AS price#1911, cast(region#21 as string) AS region#1903, cast(product#22 as string) AS product#1904, cast(upper(region#21) as string) AS region_upper#1905]
:        +- SubqueryAlias spark_catalog.default.test_table
:           +- Relation spark_catalog.default.test_table[region#21,product#22,count#23,price#24] parquet
+- Aggregate [region#1903], [region#1903, sum(count#1910) AS total_count#1901L, avg(price#1911) AS avg_price#1902]
   +- ResolvedMetricView `spark_catalog`.`default`.`test_metric_view`
      +- Project [count#23 AS count#1910, price#24 AS price#1911, cast(region#21 as string) AS region#1903, cast(product#22 as string) AS product#1904, cast(upper(region#21) as string) AS region_upper#1905]
         +- SubqueryAlias spark_catalog.default.test_table
            +- Relation spark_catalog.default.test_table[region#21,product#22,count#23,price#24] parquet

after DeduplicateRelation, the plan is below

region: string, total_count: bigint, avg_price: double
Union false, false
:- Aggregate [region#1903], [region#1903, sum(count#1910) AS total_count#1901L, avg(price#1911) AS avg_price#1902]
:  +- ResolvedMetricView `spark_catalog`.`default`.`test_metric_view`
:     +- Project [count#23 AS count#1910, price#24 AS price#1911, cast(region#21 as string) AS region#1903, cast(product#22 as string) AS product#1904, cast(upper(region#21) as string) AS region_upper#1905]
:        +- SubqueryAlias spark_catalog.default.test_table
:           +- Relation spark_catalog.default.test_table[region#21,product#22,count#23,price#24] parquet
+- Aggregate [region#1920], [region#1920, sum(count#1918) AS total_count#1923L, avg(price#1919) AS avg_price#1924]
   +- ResolvedMetricView `spark_catalog`.`default`.`test_metric_view`
      +- Project [count#1916 AS count#1918, price#1917 AS price#1919, cast(region#1914 as string) AS region#1920, cast(product#1915 as string) AS product#1921, cast(upper(region#1914) as string) AS region_upper#1922]
         +- SubqueryAlias spark_catalog.default.test_table
            +- Relation spark_catalog.default.test_table[region#1914,product#1915,count#1916,price#1917] parquet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why do we need to add alias at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed in this PR actually. But to support group by gropuing sets, there will be a Expand + Project between Aggregate and ResolvedMetricView node, and I found without this alias, DeduplicateRelation will failed to update the exprId for the Expand + Project and lead to dangling references.

}.map { attr =>
if (attr.metadata.contains(MetricViewConstants.COLUMN_TYPE_PROPERTY_KEY)) {
// no alias for metric view column since the measure reference needs to use the
// measure column in MetricViewPlaceholder, but an alias will change the exprId
Copy link
Contributor

@cloud-fan cloud-fan Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricViewPlaceholder.output is fully decoupled with its child. To not change the exprId we need to add an alias to retain the original exprId, and you are doing the opposite. The Project should be constructed like this:

Project(
  Seq(
    Alias(source_attr_1, name)(exprId = output_metric_1.exprId),
    ...
  ),
  source
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh actually here we are adding new columns not in the outputMetrics, why the exprId matters?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in b7fda7c Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants